Start-up

#rmarkdown::render(paste0("~/Documents/wgodwin28.github.io/impressions_model.Rmd"))
# clear memory
rm(list=ls())

#load libs
library(data.table)
library(ggplot2)
library(scales)
library(lubridate)
library(VGAM) #multinomial logistic
library(caret) #confusion matrix
library(randomForest) #random forest
library(C50) #boosted c5.0
library(tidyverse)

The goal is to build a model that predicts “impressions”, or the number of people the ad reached, using relevant advertisement covariates. I’ll use political advertisement spending data released by google. The dataset contains over 200,000 unique ads since June 2018 with relevant metadata including number of impressions as a ordinal categorical variable. Intially, I need to figure out useful predictors in the model by investigating the univariate distributions and basic bivariate relationships.

#Downloaded from here: https://transparencyreport.google.com/political-ads/region/US?hl=en
#read in data
dt <- fread("~/Desktop/google_ads/data/google-political-ads-creative-stats.csv")
#glimpse(dt)

#relevant variables
vars_keep <- c("Ad_Type", "Regions", "Advertiser_ID", "Date_Range_Start", "Date_Range_End", 
               "Num_of_Days", "Impressions", "Spend_Range_Min_USD", "Spend_Range_Max_USD")

#subset to variables of interest and create useful variables
dt <- dt %>%
  select(vars_keep) %>% #keep relevant variables
  mutate(startYear = substr(Date_Range_Start,1,4), #variable indicating year ad started
         Region = ifelse(grepl("EU", Regions), "EU", "US") %>% as.factor(), #create cleaner regions variable
         month_year = format(as.Date(Date_Range_Start), "%m-%Y"),
         week_start = floor_date(as.Date(Date_Range_Start), unit = "week"),
         week_end = floor_date(as.Date(Date_Range_End), unit = "week"),
         cost_cat = paste0(Spend_Range_Min_USD, "-", Spend_Range_Max_USD) %>% 
           factor(levels=c("0-100", "100-1000", "1000-50000", "50000-100000", "100000-NA"),
                  labels = c("0-100", "100-1k", "1k-50k", "50k-100k", "100k+")),
         Spend_Range_Min_USD = as.factor(Spend_Range_Min_USD),
         Ad_Type = as.factor(Ad_Type),
         Impressions = factor(Impressions, levels = c("≤ 10k", "10k-100k", "100k-1M", "1M-10M", "> 10M"),
                              labels = c("Under 10k", "10k-100k", "100k-1M", "1M-10M", "10M+")))

  #filter(regions=="US") #only include US data
dt <- dt %>%
  mutate(
    case_when(
      cost_cat=="100000-NA" ~ "100k+"
    )
  )

Load in the dataset and prep it for visualizing and modeling.

Data Exploration

Impressions

#impressions
ggplot(dt, aes(Impressions)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = percent(..count../nrow(dt)), vjust = -0.2)) +
  theme_bw()

Most ads are classified under the “Under 10k” category, indicating an imbalanced classification problem. Due to this imbalance, I’ll plot impression counts on the log scale to observe differences at smaller frequecies.

Ad cost

One might presume that money spent for an ad would be an excellent predictor of impressions, given that that most tech platforms follow an advertising model of more money spent = more exposure. Google provides dollars spent as a categorical variable of 5 bins. Let’s see how well ad dollars spent and number of impressions correlate.

table(dt$Impressions, dt$cost_cat)
##            
##              0-100 100-1k 1k-50k 50k-100k  100k+
##   Under 10k 151413   9007    533        0      0
##   10k-100k   15999  14049   4147        1      0
##   100k-1M     2193   5591   7312       33      8
##   1M-10M        14    549   1857       58     23
##   10M+           0      0    239       27     27

Since “ad cost” and “impressions” variables are both ordinal, 5-bin variables, if one was to map the categories to each other on 1 to 1 basis, we could calculate accuracy of a model that solely used “ad cost” to predict “impressions”. And turns out, it would be correct 81.1% of the time! This tells us two things: money spent on ads is a primary driver of total impressions but it’s not the only driver. Maybe we can find other predictors in the data set that could improve our future model’s accuracy.

Impressions by region

#plot impression counts by region
ggplot(dt, aes(Impressions, fill=Region)) +
  geom_bar(position = "dodge") +
  scale_y_log10() +
  ylab("Count (log 10 scale)") +
  theme_bw()

Above is a bar chart showing counts of each impression category across the two regions: EU and US. Note that these frequencies were plotted on a log 10 scale in order to illustrate relative frequencies within the lower frequency categories (100k-1M impressions, etc). The EU has fewer ads within each category and appears to have a greater proportion of its ads with >10k impressions compared to the U.S., which indicates that this variable may be a useful covariate in the model.

Impressions by ad type

#plot impression counts by type of ad
ggplot(dt, aes(Impressions, fill=Ad_Type)) +
  geom_bar(position = "dodge") +
  scale_y_log10() +
  ylab("Count (log 10 scale)") +
  theme_bw()

This figure shows counts (on log 10 scale) of ads across impression category for each type of ad. “Image” ads remain the most popular type across impression category, while “Text” ads significantly decrease as number of impressions increase, relatively to other ad types.

Days ad aired

dt %>%
  mutate(days_running=difftime(Date_Range_End, Date_Range_Start, units = "days")) %>%
  ggplot(aes(Impressions, days_running)) +
    geom_boxplot() +
    ylab("Number of Days Aired") +
    theme_bw()

This figure shows box and whiskers of number of days ad was aired across impression category. While these data do look noisy, we see indication of a potential trend: as ad air time increases so does the number of impressions.

Based on the exploratory plots and tables, the covariates we’ll use to predict impressions are: cost of the ad, ad type (text, video, or image), region the ad aired (U.S. or E.U.), and number of days the ad aired.

Modeling

#model reponse and covariates
response <- "Impressions"
covariates <- c("Num_of_Days", "Region", "Ad_Type", "cost_cat")

#create training data frame
train <- sample(c(TRUE, FALSE), nrow(dt), rep=TRUE)
dt.train <- dt %>%
  select(response, covariates) %>%
  filter(train)

dt.test <- dt %>%
  select(response, covariates) %>%
  filter(!train)

#extract the testing response and predictors for prediction
x.test <- dt.test %>% select(covariates)
y <- dt.test %>% select(response)

#create standard formula object
mod.formula <- as.formula(paste(response, 
                            paste(covariates, collapse = "+"), sep = " ~ "))

We’ll model impressions using 3 different methods: logistic regression and two tree-based methods. Logistic is helpful as a first pass since it generally performs well and produces interpretable coefficients. The data will be ramdomly split into a training and testing set for evaluation of model performance.

Multinomial logistic regression

#Build the model
model1 <- vglm(formula = mod.formula, 
             data = dt.train, family = "multinomial")

#extract model summary
summary(model1)
## 
## Call:
## vglm(formula = mod.formula, family = "multinomial", data = dt.train)
## 
## Pearson residuals:
##                        Min        1Q    Median         3Q    Max
## log(mu[,1]/mu[,5]) -321.35  0.017986  0.149250  0.3068891  11.66
## log(mu[,2]/mu[,5]) -294.34 -0.268472 -0.209857 -0.0234441  28.93
## log(mu[,3]/mu[,5])  -61.13 -0.088700 -0.065751 -0.0088781 621.84
## log(mu[,4]/mu[,5])  -53.78 -0.006658 -0.003166 -0.0001966 203.97
## 
## Coefficients: 
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1       2.001e+01  5.678e+01      NA       NA    
## (Intercept):2       1.783e+01  5.678e+01   0.314 0.753486    
## (Intercept):3       1.639e+01  5.678e+01   0.289 0.772893    
## (Intercept):4       1.183e+01  5.678e+01   0.208 0.834978    
## Num_of_Days:1      -2.870e-02  1.395e-03 -20.570  < 2e-16 ***
## Num_of_Days:2      -1.119e-02  1.312e-03  -8.526  < 2e-16 ***
## Num_of_Days:3      -7.109e-03  1.251e-03  -5.684 1.32e-08 ***
## Num_of_Days:4      -3.746e-03  1.243e-03  -3.012 0.002591 ** 
## RegionUS:1          2.468e+00  2.291e-01  10.774  < 2e-16 ***
## RegionUS:2          3.014e+00  2.288e-01  13.172  < 2e-16 ***
## RegionUS:3          1.893e+00  2.254e-01   8.400  < 2e-16 ***
## RegionUS:4          6.959e-01  2.264e-01   3.074 0.002115 ** 
## Ad_TypeText:1       1.559e+01  8.886e-01  17.542  < 2e-16 ***
## Ad_TypeText:2       1.036e+01  8.863e-01  11.688  < 2e-16 ***
## Ad_TypeText:3       5.909e+00  8.842e-01   6.683 2.35e-11 ***
## Ad_TypeText:4       3.017e+00  8.616e-01   3.502 0.000462 ***
## Ad_TypeVideo:1      5.582e+00  2.068e-01  26.984  < 2e-16 ***
## Ad_TypeVideo:2      3.985e+00  2.038e-01  19.549  < 2e-16 ***
## Ad_TypeVideo:3      1.935e+00  1.978e-01   9.787  < 2e-16 ***
## Ad_TypeVideo:4      6.630e-01  2.023e-01   3.278 0.001046 ** 
## cost_cat100-1k:1   -5.828e+00  1.065e+02  -0.055 0.956346    
## cost_cat100-1k:2   -1.297e+00  1.065e+02  -0.012 0.990283    
## cost_cat100-1k:3    8.575e-01  1.065e+02   0.008 0.993573    
## cost_cat100-1k:4    3.943e+00  1.065e+02   0.037 0.970458    
## cost_cat1k-50k:1   -2.854e+01  5.678e+01  -0.503 0.615181    
## cost_cat1k-50k:2   -2.082e+01  5.678e+01  -0.367 0.713857    
## cost_cat1k-50k:3   -1.543e+01  5.678e+01  -0.272 0.785841    
## cost_cat1k-50k:4   -1.047e+01  5.678e+01  -0.184 0.853698    
## cost_cat50k-100k:1 -5.398e+01  5.773e+03      NA       NA    
## cost_cat50k-100k:2 -4.778e+01  5.313e+03      NA       NA    
## cost_cat50k-100k:3 -2.050e+01  5.678e+01  -0.361 0.718112    
## cost_cat50k-100k:4 -1.287e+01  5.678e+01  -0.227 0.820680    
## cost_cat100k+:1    -5.538e+01  7.978e+03      NA       NA    
## cost_cat100k+:2    -4.949e+01  7.469e+03      NA       NA    
## cost_cat100k+:3    -2.299e+01  5.679e+01  -0.405 0.685567    
## cost_cat100k+:4    -1.366e+01  5.678e+01  -0.241 0.809938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), 
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
## 
## Residual deviance: 90046.23 on 425752 degrees of freedom
## 
## Log-likelihood: -45023.11 on 425752 degrees of freedom
## 
## Number of Fisher scoring iterations: 21 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', 'cost_cat50k-100k:1', 'cost_cat50k-100k:2', 'cost_cat100k+:1', 'cost_cat100k+:2'
## 
## 
## Reference group is level  5  of the response
#Predict using the model
probability <- predict(model1, x.test, type="response")
dt.test <- dt.test %>%
  mutate(predicted_cat = apply(probability, 1, which.max),
         predicted_name = case_when(predicted_cat==1 ~ "Under 10k",
                                    predicted_cat==2 ~ "10k-100k",
                                    predicted_cat==3 ~ "100k-1M",
                                    predicted_cat==4 ~ "1M-10M",
                                    predicted_cat==5 ~ "10M+"),
         predicted_name = factor(predicted_name, 
                                 levels = c("Under 10k", "10k-100k", "100k-1M", "1M-10M", "10M+")))

#Accuracy of the model
mtab <- table(dt.test$predicted_name, dt.test$Impressions)
confusionMatrix(mtab)
## Confusion Matrix and Statistics
## 
##            
##             Under 10k 10k-100k 100k-1M 1M-10M  10M+
##   Under 10k     79387     8358    1110     12     0
##   10k-100k       1257     7403    2334    161     1
##   100k-1M           8     1252    4048    956   105
##   1M-10M            0        0      69    129    14
##   10M+              0        0       0      8    21
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.8511, 0.8554)
##     No Information Rate : 0.7564          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5793          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity                    0.9843         0.43514        0.53538
## Specificity                    0.6351         0.95812        0.97657
## Pos Pred Value                 0.8933         0.66359        0.63558
## Neg Pred Value                 0.9288         0.89935        0.96496
## Prevalence                     0.7564         0.15955        0.07091
## Detection Rate                 0.7445         0.06943        0.03796
## Detection Prevalence           0.8334         0.10462        0.05973
## Balanced Accuracy              0.8097         0.69663        0.75598
##                      Class: 1M-10M Class: 10M+
## Sensitivity               0.101896   0.1489362
## Specificity               0.999212   0.9999249
## Pos Pred Value            0.608491   0.7241379
## Neg Pred Value            0.989316   0.9988743
## Prevalence                0.011872   0.0013223
## Detection Rate            0.001210   0.0001969
## Detection Prevalence      0.001988   0.0002720
## Balanced Accuracy         0.550554   0.5744305

The confusion matrix shows the model predictions (row-wise) stacked against the actual data (column-wise). If the model fit the data perfectly, we’d only see values along the diagonal and would see zeros everywhere else. The overall accuracy is 85%, indicating that the model correctly labels the testing data 85% of the time. Sensitivity (we’ll use “recall”) and specificity varies across the impression categories, as we’d expect. When predicting “Under 10k” impressions, a recall of 0.98 indicates that the model get 98% of the actual “Under 10k” impressions correct. The relatively poor specificity indicates that the model correctly predicts that an ad will NOT get “Under 10k” impressions 63% of the time. We see the opposite result from the “10M+” impressions category. Because we have an imbalanced classification problem with so few “10M+” impression ads, the model can predict that an ad will not get “10M+” impressions with 99.99% confidence. However, recall of 0.13 reveals that 87% of actual “10M+” impression ads are incorrectly labeled by the model.

Model evaluation can be based on overall accuracy of the model or on more specific metrics like precision or recall, depending on the research question. For instance, consider that the goal of many political action committees (PAC) is to reach as many people as possible using as little money as possible. In order to figure out how to do this, they could start with looking at what covariates are conditionally associated with number of impressions from the logistic model coefficients. We can see that even controlling for ad cost, an ad with longer air time tends to achieve more impression. Going even further, they could build a model that optimizes for their main goal-accurately predicting ads with millions of impressions or recall. As we noted, the logistic model above has poor recall for impression categories of particular interest as it correctly labels a “1-10M” impression ad only 10% of the time and a “10M+” impression ad only 13% of the time. Compare this to our original “model, that simply used the cost of ad to predict impressions, which had a recall for”10M+" of 9%. Both models leave room for improvement. Let’s see if we can improve the recall at the high impression end using a tree-based models.

Random Forest

#Build the model
model2 <- randomForest(mod.formula, data = dt.train)

#Summarize the model
summary(model2)
##                 Length Class  Mode     
## call                 3 -none- call     
## type                 1 -none- character
## predicted       106447 factor numeric  
## err.rate          3000 -none- numeric  
## confusion           30 -none- numeric  
## votes           532235 matrix numeric  
## oob.times       106447 -none- numeric  
## classes              5 -none- character
## importance           4 -none- numeric  
## importanceSD         0 -none- NULL     
## localImportance      0 -none- NULL     
## proximity            0 -none- NULL     
## ntree                1 -none- numeric  
## mtry                 1 -none- numeric  
## forest              14 -none- list     
## y               106447 factor numeric  
## test                 0 -none- NULL     
## inbag                0 -none- NULL     
## terms                3 terms  call
#Predict using the model
dt.test$pred_randomforest <- predict(model2, x.test)

#Accuracy of the model
mtab2 <- table(dt.test$pred_randomforest, dt.test$Impressions)
confusionMatrix(mtab2)
## Confusion Matrix and Statistics
## 
##            
##             Under 10k 10k-100k 100k-1M 1M-10M  10M+
##   Under 10k     79596     8513    1136     12     0
##   10k-100k       1043     7006    1932    154     1
##   100k-1M          13     1492    4376    884    56
##   1M-10M            0        2     113    209    42
##   10M+              0        0       4      7    42
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8555          
##                  95% CI : (0.8534, 0.8576)
##     No Information Rate : 0.7564          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5839          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity                    0.9869         0.41180        0.57876
## Specificity                    0.6282         0.96507        0.97532
## Pos Pred Value                 0.8918         0.69120        0.64155
## Neg Pred Value                 0.9392         0.89630        0.96809
## Prevalence                     0.7564         0.15955        0.07091
## Detection Rate                 0.7464         0.06570        0.04104
## Detection Prevalence           0.8370         0.09506        0.06397
## Balanced Accuracy              0.8075         0.68844        0.77704
##                      Class: 1M-10M Class: 10M+
## Sensitivity               0.165087   0.2978723
## Specificity               0.998510   0.9998967
## Pos Pred Value            0.571038   0.7924528
## Neg Pred Value            0.990053   0.9990711
## Prevalence                0.011872   0.0013223
## Detection Rate            0.001960   0.0003939
## Detection Prevalence      0.003432   0.0004970
## Balanced Accuracy         0.581798   0.6488845

Compared to multinomial logistic regression, random forest does perform better for lower frequency classes like “1-10M” and “10M+”. Despite doubling the recall observed in logistic regression, a model that correctly labels an ad to get “10M+” impressions only 30% of the time leaves a lot to be desired. Outside of the scope of this analysis but in an effort to improve recall, we could oversample the minority impression categories. This has the effect of balancing out the distribution of impressions and would likely improve prediction in minority categories. The overall accuracy of the random forest model is similar to the logisitic regression at 85%, indicating that the gains made in recall for high impression classes may have come at cost for other components of model performance.

Boosted C5.0

#Build the model
model3 <- C5.0(mod.formula, data = dt.train, trials = 8)

#Predict using the model
dt.test$pred_c50 <- predict(model3, x.test)

#Accuracy of the model
mtab3 <- table(dt.test$pred_c50, dt.test$Impression)
confusionMatrix(mtab3)
## Confusion Matrix and Statistics
## 
##            
##             Under 10k 10k-100k 100k-1M 1M-10M  10M+
##   Under 10k     79563     8474    1129     12     0
##   10k-100k       1023     6051    1476     88     1
##   100k-1M          66     2488    4951   1116   116
##   1M-10M            0        0       5     50    24
##   10M+              0        0       0      0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8498          
##                  95% CI : (0.8476, 0.8519)
##     No Information Rate : 0.7564          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5692          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Under 10k Class: 10k-100k Class: 100k-1M
## Sensitivity                    0.9865         0.35567        0.65481
## Specificity                    0.6299         0.97112        0.96179
## Pos Pred Value                 0.8922         0.70043        0.56667
## Neg Pred Value                 0.9376         0.88814        0.97334
## Prevalence                     0.7564         0.15955        0.07091
## Detection Rate                 0.7461         0.05675        0.04643
## Detection Prevalence           0.8363         0.08102        0.08194
## Balanced Accuracy              0.8082         0.66340        0.80830
##                      Class: 1M-10M Class: 10M+
## Sensitivity              0.0394945    0.000000
## Specificity              0.9997248    1.000000
## Pos Pred Value           0.6329114         NaN
## Neg Pred Value           0.9885879    0.998678
## Prevalence               0.0118725    0.001322
## Detection Rate           0.0004689    0.000000
## Detection Prevalence     0.0007409    0.000000
## Balanced Accuracy        0.5196096    0.500000

A Boosted C5.0 model is based on simple tree-based framework that uses “boosting” methods. While a random forest splits the predictor space on into partitions that minimize impurity/maximize information criterion for each independent tree, boosting models grow trees sequentially with the residuals of the previous tree becoming the response variable of the subsequent tree. While this smoothing over residuals may sometimes improve model performance, in this context, the random forest performed slightly better overall.

There are a variety of other models one could use to classify impressions from naive Bayes to support vector machines, which could lead to improved overall accuracy and improved recall. There’s also feature engineering that we didn’t investigate at length (like lagged variables or midterm-related associations). Those pursuits are fodder for future projects. The takeaway from this analysis is that logistic regression, while some times not as accurate, still can construct a useful springboard for further analyses due to its interpretability. And regression trees-random forests and boosting methods-can be fast, flexible frameworks for optimizing toward a specific performance metric.